Machine learning-derived gut microbiome signature predicts fatty liver disease in the presence of insulin resistance

A simple predictive biomarker for fatty liver disease is required for individuals with insulin resistance. Here, we developed a supervised machine learning-based classifier for fatty liver disease using fecal 16S rDNA sequencing data. Based on the Kangbuk Samsung Hospital cohort (n = 777), we generated a random forest classifier to predict fatty liver diseases in individuals with or without insulin resistance (n = 166 and n = 611, respectively). The model performance was evaluated based on metrics, including accuracy, area under receiver operating curve (AUROC), kappa, and F1-score. The developed classifier for fatty liver diseases performed better in individuals with insulin resistance (AUROC = 0.77). We further optimized the classifiers using genetic algorithm. The improved classifier for insulin resistance, consisting of ten microbial genera, presented an advanced classification (AUROC = 0.93), whereas the improved classifier for insulin-sensitive individuals failed to distinguish participants with fatty liver diseases from the healthy. The classifier for individuals with insulin resistance was comparable or superior to previous methods predicting fatty liver diseases (accuracy = 0.83, kappa = 0.50, F1-score = 0.89), such as the fatty liver index. We identified the ten genera as a core set from the human gut microbiome, which could be a diagnostic biomarker of fatty liver diseases for insulin resistant individuals. Collectively, these findings indicate that the machine learning classifier for fatty liver diseases in the presence of insulin resistance is comparable or superior to commonly used methods.

contributor to T2DM and metabolic syndrome, is a potential driver of NAFLD progression in nonalcoholic steatohepatitis (NASH) 4,5 . IR is a common pathological factor of NASH and T2DM. It may also be involved in hepatocellular carcinoma (HCC) development 6 .
Liver biopsy or magnetic resonance imaging/spectroscopy (MRI/MRS) is the gold standard for diagnosing FL 7 . However, it has certain disadvantages. For instance, liver biopsy is invasive, and the examination is spatially limited to the sample. MRI/MRS is inaccessible and expensive. The most popular method for FL diagnosis is ultrasonography (USG). However, it also has several limitations: (i) variability depending on the spots observed, (ii) low sensitivity in mild hepatic steatosis occupying < 30% of the liver 8 ; and (iii) inconsistencies in diagnosis between interpreters. Therefore, an additional non-invasive diagnostic biomarker for FL is required urgently.
The symbiosis or dysbiosis between the human gut microbiome (HGM) and hosts is linked to the health or disease of the host, respectively 9 . According to recent reports, alterations in the gut microbiome could be a driver of obesity and IR [10][11][12] . It has also been shown that specific signatures of the HGM can serve as biomarkers of liver diseases, including NASH, liver fibrosis, and cirrhosis [13][14][15] . However, only a few studies have proposed the HGM as a biomarker for general FL rather than for advanced liver diseases. Most studies investigated either the correlation or causality of HGM with liver diseases only in a pathological cohort rather than a healthy one. Furthermore, those studies did not invest in participants with IR, who have a higher risk for advanced FL disease, in the healthy cohort. Therefore, our study established a potential gut microbiome to identify the FL of participants with IR in a healthy cohort using supervised machine learning (ML) methods for classification, such as random forest (RF), gradient boosting machine (GBM), extreme gradient boosting (XGB) algorithm, along with genetic algorithm (GA), a random-based algorithm inspired by natural selection in biology to obtain the optimized solution 16 .

Materials and methods
Human participants and data collection. Stool samples were collected from the study participants (n = 1,463) from the routine annual comprehensive physical examination of the Kangbuk Samsung Health (KSH) cohort. The study participants underwent extensive periodic PE between June and September 2014 17 and 213 participants were excluded from 1,463 participants because of missing data and poor detection. FL was diagnosed using abdominal USG with a 3.5-MHz transducer based on conventionally captured images by trained radiologists who were blinded to the study's predetermined parameters as previously described 18,19 . In the diagnosis, the inter-observer reliability value was Cohen's kappa coefficient of 0.74, and the intra-observer reliability value was 0.94 20 .
The Institutional Review Board of Kangbuk Samsung Hospital authorized the study's protocol (2019-05-015). All participants signed a written informed consent form after being informed of possible outcomes and the nature of the study. In the study, we obeyed all applicable regulations of institutions and governments regarding human research ethics for participants, following the guidelines of the Declaration of Helsinki 21 .
DNA purification and 16S rDNA gene sequencing. The Illumina MiSeq platform was used to sequence the fecal DNA samples, following the provided protocol (Illumina, San Diego, CA, USA) 22 . The DADA2 plugin of the QIIME 2 package (v.2020.8) was utilized in filtering out chimeras and low-quality sequences and to produce amplicon sequence variants (ASVs) 23,24 . The naïve Bayes classifier were trained, and the classifier was used to assign ASVs to microbial taxonomy against the SILVA 132 with a 99% operational taxonomy unit dataset. All 16S rDNA gene sequencing files are available in the Clinical & Omics Data Archive of the Korea National Institute of Health (accession number: R000635). Development of ML classifier and evaluation. R Package "caret" v.6.0-86. was used for ML approach using three ML algorithms (RF, GBM, and XGB) 25 . The RF parameter options were set to default option in the ML approach. In the GBM models' hyperparameter settings, 10,20,30,40, and 50 were used as "n.trees"; one, two, three, and four as "interaction.depth"; 0.01 and 0.001 as "shrinkage"; three, five, seven, and nine as "n.minobsinnode. " For the hyperparameter setting in the XGB models, 10, 20, 30, 40, and 50 was used as "nrounds"; three, five, seven, and nine as "max_depth"; 0.01 and 0.2 as "eta"; 0.01 as "gamma"; 0.75 as "colsample_ by_tree"; 1 as "min_child_weight"; and 0.5 as "subsample. " The microbiome dataset was randomly partitioned into training (80%) and test (20%) datasets using the createDataPartition function. The dataset was preprocessed using the zv, scale, and center methods of the training function. The Synthetic Minority Over-sampling Technique (SMOTE) function in the R package "smotefamily" (v.1.3.1) was used to handle the sample imbalance issue. The tenfold three-times repeated cross-validation were applied to the training dataset for ML-based development into the previously described classification model 13 . The sequential feature selection was conducted based on the Gini importance of the features. The performance of the developed classifiers was assessed using the area under the receiver operating curve (AUROC), representing their sensitivity and specificity in the training dataset. Using the test dataset, the classifiers were evaluated using AUROC, accuracy, F1-score, and kappa.
The optimal feature selection by GA. For a GA-based optimal feature selection, 300 individuals were randomly generated as the initial population to be sequentially evolved further by GA. The individuals carry a specific number of genera (described as "genes") randomly selected from 87 gut microbial genera detected in the fecal samples. The selected genera were encoded as one, and the other genera were encoded as 0 in individuals to be evolved by GA. Each individual in the population was evaluated using a fitness score as follows: www.nature.com/scientificreports/ where S k is the AUROC score from the RF model in the k-th fold during M fold cross-validation; x is the number of genera selected by the RF; W is a penalty weight; and b ∈ {6, 7, 8, 9, 10} is the optimal number of biomarker genera. M was set to 3 and W to 10. According to the fitness score, GA repeatedly searches for the best solution for classifying every generation. Firstly, in the initial population, GA selected the fittest individual with the highest fitness score in the initial population (first generation). To generate the population of the next generation, the fittest individual of the previous generation is kept, while the other individuals of the previous generation are influenced by crossovers and mutations, resulting in different individuals. Thus, we obtained 300 individuals in each generation. Then, these steps were iterated 100 times (generations) to get the best solution through the entire generation.
Package "DEAP" v.1.3.1 under python 3.7.1 was used for the GA simulations, revealing the optimal genera having the best fitness. The optimal genera served as the features of an optimal RF classification model. The crossover rate, mutation rate, and generations for GA simulations were set at 0.8, 0.003, and 300, respectively. Data visualization and statistical analyses. All statistical analysis and visualizations were conducted using RStudio with R v.4.1. The normality of the overall data was analyzed using Shapiro's test. Statistical significance was computed with either two-tailed Wilcoxon's test or Kruskal-Wallis test upon the normality and distribution, and a p-value < 0.05 was deemed statistically significant. The R package "ggplot2" and "pROC" were used for data visualization.
Evaluation of model performance. To evaluate the model performance, we obtained a 2 × 2 confusion matrix from each classifier using the test dataset and calculated true positive, true negative, false positive, and false negative using predicted and observed classes. Then, we calculated and adopted four metrics for the model's performance evaluation: accuracy 26,27 , F1-score 27 , kappa index 28 , and AUROC, plotted using true positive rate and false positive rate (FPR) 29 .

Results
Data processing in the healthy and FL cohort. Physical examination data, including high throughput 16S rDNA sequencing results from 1463 participants of KSH cohorts, were collected for the study. After removing missing and poorly detected values, 1,250 participants were included in the analysis. Subsequently, participants whose ASV number were < 5 000 were filtered out, leaving only 777 participants for the study (Fig. 1). It was revealed that 290 of the 777 participants had FL, while the remaining 487 did not. We regarded participants as insulin resistant or sensitive if their values for homeostatic model assessment of IR (HOMA-IR) were over the following cutoff: 1.8 for men and 2.2 for women, calculated as the critical threshold for T2DM development based on the KSH cohort internal investigation (data not shown). Among the 777 participants, 611 were classified into the insulin-sensitive (IS) group, while 166 were classified into the IR group based on the criteria for insulin resistance. The biological and physical characteristics of these groups are described in Table 1.

Subject demographics.
Among the 777 participants in the study, IS and IR included 611 and 166 individuals, respectively. Men accounted for 55.97% of IS and 84.94% of the IR group. The IS group had significantly lower values than those of the IR group for age, body mass index (BMI), waist circumference, heart rate, HOMA-IR, glucose, insulin, HbA1c, albumin, aspartate aminotransferase, alanine transaminase, triglycerides (TG), low-density lipoprotein cholesterol (LDL-C), and both diastolic and systolic blood pressure (BP). The IS group had lower total cholesterol than the IR group, but the difference was not statistically significant. Moreover, participants of IS group had higher high-density lipoprotein cholesterol (HDL-C) levels than those of the IR group with statistical significance (Table 1).
As NF and FL showed differential alpha diversity but not beta diversity, we generated a classification model with informative gut microbial features. We can perform Gini importance-based core informative feature selection based on the algorithm. The predictive power of the models from the training dataset with two, four, eight, 12, 16, 24, and 32 features were 0.53, 0.59, 0.60, 0.59, 0.62, and 0.64 of AUROC, respectively (Fig. 2e). The FL prediction using the model featuring 32 gut microbial features displayed 0.65 (0.56-0.73) of AUROC in the test dataset (Fig. 2f). This implies an inefficient classification based on a set of most informative features between the NF and FL groups. Microbiome comparison and classification between IRNF and IRFL. NAFLD is considered the hepatic component of IR. Therefore, it is critical to distinguish FL from NF in participants with IR. The participants in the IR groups were divided into the following based on the presence of FL: NF featuring IR (IRNF) and FL featuring IR (IRFL), to find the most informative microbial features differentiating FL from NF in the participants with IR. Then, we observed the differential biodiversity of the two microbiomes. IRNF (median = 6.808, IQR = 6.289-7.183) had a significantly higher value of the index than that of IRFL (median = 6.403, IQR = 5.988-6.805; p = 0.032) in terms of Shannon's entropy (Fig. 3a) Fig. 3c). However, PCoA and uniform manifold www.nature.com/scientificreports/ approximation and projection (UMAP) of the total microbiome of both groups showed no difference in clusters between IRFL and IRNF (Fig. 3d, e, Supplementary Fig. 2a and b).
To classify IRFL and IRNF from their gut microbiome, we constructed ML models using three ML algorithms for classification, RF, GBM, and XGB. Among the three ML models featuring different numbers of gut microbial genera, the RF model demonstrated the most reliable prediction in the test dataset (AUROC 0.77), while the AUROCs of classification for the other two ML models, GBM and XGB, were 0.62 and 0.63, respectively (Fig. 3f). Next, we built the models in the same manner, but individually for each gender, to see if any gender had better predictive results. In the training dataset, the RF model for females displayed AUROC values of 0.81, 0.96, 0.88, and 0.73 for the models using six-, eight-, twelve-feature, and entire gut microbiome, respectively ( Supplementary Fig. 2c). The predictive power of the eight-feature model showed 0.67 AUROC. Surprisingly, the aforementioned outcome in the training dataset was superior to the results from the RF model for male, presenting AUROC values of 0.63 (six-feature model), 0.76 (eight-feature model), 0.69 (twelve-feature model), and 0.58 (entire gut microbiome-based model) (Supplementary Fig. 2d). During model validation using the male test dataset, the RF model had an AUROC of 0.76, the GBM model had an AUROC of 0.62, and the XGB model had an AUROC of 0.77 (Supplementary Fig. 2e). Together, it was determined that the models using the RF algorithm are appropriate for further research.
Then, we built RF models and assessed their efficacy in predicting FL in the IR groups after applying the SMOTE algorithm to the dataset to minimize the present class imbalance (30% IRNF: 70% IRFL). The model's predictive power was 0.87 AUROC in the training dataset, but it only displayed 0.72 AUROC in the test dataset (Supplementary Table S1).

Classification between IRNF and IRFL by using GA-optimized classifier (IRFL-GARF classifier).
GA is a heuristic algorithm that determines the global optimum based on natural selection [30][31][32] . It can be used to select model features such that the model demonstrates the best prediction. We used GA to create an ML classifier with better prediction performance using the RF algorithm. We developed an RF classifier presenting higher accuracy in distinguishing IRFL from IRNF, based on the features selected by GA. The RF classifier optimized by GA was termed "IRFL-GARF classifier, " with the potential gut microbial biomarkers 33 . Using the fitness score, the classifier can repeatedly search for the best solution for classifying IRFL and IRNF every generation.
Also, we developed a GA-optimized classifier for IS (a classification between IS participants without FL, ISNF, and IS participants with FL, ISFL). The model featured eight gut microbial genera, namely, Eubacterium spp.

Model evaluation.
To assess the GA-optimized model's performance in IR, the model's predictive power was compared with previously and broadly used non-invasive indexing scores calculated from clinical data for predicting FL, including FL index (FLI) 34 , NAFLD liver fat score (NAFLD-LFS) 35 , hepatic steatosis index (HSI) 36 , and Framingham steatosis index (FSI) 37 . For comparison, each score was calculated for each IR analyzed for the study and used for FL prediction with a partitioned test dataset. Our classifier displayed 0.93 AUROC, as the FLI, NAFLD-LFS, HSI, and FSI values were 0.82, 0.62, 0.80, and 0.82, respectively (Fig. 6a). The prediction accuracies of the GA-optimized classifier, FLI, NAFLD-LFS, HSI, and FSI were 0.83, 0.57, 0.60, 0.67, and 0.84, respectively (Fig. 6b). Additionally, the FL prediction by our classifier presented a kappa of 0.50, while the kappa of FLI, NAFLD-LFS, HSI, and FSI were 0.24, 0.17, 0.33, and 0.53, respectively (Fig. 6c). Finally, our classifier displayed 0.89 F1-score, which was similar to the FSI (0.90), whereas FLI, NAFLD-LFS, and HSI displayed 0.63, 0.63, and 0.72 of F1-scores, respectively (Fig. 6d). As shown above, among all measuring methods for predicting the power of predictors, our classifier gave the highest diagnostic accuracy compared with other predictors. This result implied that our gut microbiome-based classifier could be used with the abovementioned established predictors.
GA is an adaptive metaheuristic search algorithm that identifies the global optimum based on the principle of natural selection in evolution. GA does not evaluate solutions individually but evaluates a group of solutions simultaneously and explores the space of possible solutions. Furthermore, GA has the advantage of being less likely to fall into a local minimum and does not require assumptions about the interaction between features.
The ten genera could be developed into a non-invasive biomarker for FL disease in insulin-resistant participants, who could have a higher chance of developing advanced chronic liver diseases. To the best of our knowledge, this is the first study that used GA to successfully classify FL diseases, encouraging GA application in future studies. The development of a non-invasive, inexpensive, and accurate method for diagnosing FL disease is required. Several recent studies have proposed the gut microbiome as a potential biomarker for advanced chronic liver diseases [13][14][15]38 ; however, only a few studies have been conducted in generally healthy populations to identify individuals at higher risk of developing advanced chronic liver diseases based on the gut microbiome. For instance, it would be important to identify generally healthy participants showing IR without symptoms with a higher risk of advanced FL diseases.
In this study, alpha diversity, which reflects the gut microbiome structure concerning its richness 39 , decreased significantly in the FL participants. In contrast, the PCoA plots, representing beta diversity of the gut microbiome, failed to generate two distinct groups, inconsistent with previous studies 13,38 . Estimating alpha and beta diversities implies that reduced richness was sufficient to show differences in alpha diversity between the groups. However, it occurred in only a few genera (components). Although it is insufficient to determine whether the altered genera drove FL or vice versa, several studies have reported that the family Christensenellaceae correlates with BMI 40 , and the Ruminococcaceae (R-7 group) genera correlate with blood TG, very-low-density lipoproteinand HDL-particles levels 41 . Another recent human study observed a strong correlation between Collinsella spp. and NASH and cirrhosis 42 . Few studies have indicated that Butyricimonas spp. is altered in AFLD and HCC 43,44 , implying that Butyricimonas spp. might contribute to FL disease or hepatic inflammation.
There are a few limitations to our study. Firstly, our research was based on a Korean hospital cohort with not quite large patients; thus, our results could be racially and geographically biased. However, we tried to prove the feasibility of the discovered ten genera as a biomarker to identify FL disease among patients with IR from the supporting studies. Secondly, it was implied that predicting FL using gut microbiome-based ML in the female IR groups could be more reliable rather than in the male group. However, greater sample size is needed for further validation. Additionally, to our knowledge, there was no independent external validation cohort available; thus, we expect that further validation studies population will address these limitations. Although our study Figure 4. The overview of biomarker genera mining using GA. From the randomly generated initial 300 individuals consisting of genera, the classification model was optimized using GA methods, including crossover and mutation. The model with the highest fitness score is selected for each generation and further sequentially optimized in the next generation. The final model was evaluated using the average AUROC of the tenfold CV model. Further model validation was conducted using test data for the corresponding biomarker subset and accuracy, an F1-score, a kappa, and an AUROC. www.nature.com/scientificreports/ www.nature.com/scientificreports/ is primarily correlative, our data strongly support the value of future work exploring the causal role of the ten genera in liver diseases. Conclusively, these findings indicate that the ML classifier combined with GA for FL in the presence of IR is comparable or superior to commonly used methods. The ten genera we discovered are useful as a non-invasive biomarker for FL among patients with IR. Figure 6. Evaluating prediction using GA-optimized classifier differentiating IRFL from IRNF. Bar plots comparing the predictive power derived from the test dataset using the GA-optimized classifier with other predictors by (A) AUROC, (B) accuracy, (C) kappa, and (D) F1 score. IRFL: fatty liver participants featuring insulin resistance; IRNF: nonfatty liver control group featuring insulin resistance.